CAGEF_services_slide.png

Lecture 07: Flow control


0.1.0 About Introduction to R

Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.

0.1.1 Where is this course headed?

We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:

data-science-explore.png

0.1.2 How do we get there? Step-by-step.

In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.

Draw_an_Owl-2.jpg

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.


0.2.0 Class Objectives

This is the final in a series of seven lectures. Last lecture we explored the realm of statistical analyses with linear regression and other general linear models. Now we arrive at the final destination, addressing how to create looping and branching code, as well as our own functions in the topic of control flow. At the end of this session we will have covered:

  1. Control of flow statements.
  2. Combining control flow with useful functions.
  3. Build your own function in R.
  4. Saving data and your workspace.

0.3.0 A legend for text format in Jupyter markdown

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R

0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs and Recordings

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.


0.4.4 Data used in this session

Today we'll be keeping it simple by working with a dataset to help us demonstrate the power of looping, and user-defined functions.

0.4.4.1 Dataset 1: compounds_stats.csv

We'll be working with this dataset to help us work through the different aspects of control flow.

0.4.4.2 Source file: lecture07.R

We'll be using this source file later to show how you can save your own functions and import them for data analysis.


0.5.0 Packages used in this lesson

The following packages are used in this lesson:

Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda. conda install -c conda-forge r-biocmanager BiocManager::install("limma") conda install -c conda-forge r-gee conda install -c conda-forge r-multcomp


1.0.0 Control flow moves you beyond linear programming

Gru_copy_paste.jpg

Although we have only briefly touched on some of the aspects regarding control flow, it has been implemented behind the scenes in many of the functions you've used throughout this course. From your experience in Jupyter Notebooks, the order in which a code cell's individual statements or instructions are executed can be considered part of control flow. Expanding on this idea, when you see the number order of the code cells, this also indicates the control flow of the entire notebook or program. Once a code cell is run, the objects it has generated remain stored in memory and available for access.

Within our code cells and overall program, control flow can involve statements that help to generate choice loops, conditional statements, and move throughout the program. These specific statements allow us to run different blocks of code at different times. This can be accomplished through

In this lecture, we'll touch on all of these concepts to give you a taste of how you can make your programs accomplish more with less actual code. Let's start by loading up an example dataset to play around with.


1.1.0 Use for() loops to repeat commands for a maximum number of iterations

R doesn't care if you write the same code 1000 times or have the interpreter repeat a single copy 1000 times. However, the second is a lot easier for you. The for() loop helps to reduce code replication by compartmentalizing a set of instructions to repeat instead of copying and pasting the same code several times.

More specifically, a for() loop executes a statement repetitively until a well-defined endpoint. In this case, it determines when a specific variable's value is no longer contained in a given sequence.

For example, let's say that we want to add a + 2 10 times and overwrite it everytime:

Sure, 10 times is doable by hand, just copy-paste. But what if you need to perform that same task, say 1,000 times? What if the code was more complex than a <- 2? That is when for() loops come to the rescue.


1.1.1 The for loop can be described in three stages

  1. for(x in y): Set a variable to equal the next value in a sequence
  2. { code to run } Run a set of code with that variable at that value
  3. Repeat (hence the loop)

There are a number of ways to set the counting variable within the for() initialization. In reality, you just need to supply a vector of elements for it to iterate through. This could be a sequence where y is defined as a:b, or a numeric vector, or even a vector of objects! Each of these is assigned to x in our loop and must be used appropriately.

Note that without {...} enclosing your code, R will run only the first statement right after the for() call. This can exist on the same line, or on the next line. Subsequent lines, regardless of indentation, will not be run as part of the loop. This behaviour lets you quickly build a simple for() loop or you can extend the behaviour to accomplish many or more complex tasks.

Let's take a look at the seq() function and how you can use it within a for() loop.


1.1.2 Common functions are just pre-programmed for() loops

As was mentioned at the start of this section, under the hood, many of the functions that we commonly use are just for() loops. We can easily replicate them with explicit for loops but it takes up extra coding time! For example, we can replicate the rep() function.

Let's duplicate the function of rep() with a for() loop!


1.1.3 Self-referencing variables must be declared outside your loops

Why did we declare result <- x ahead of the for loop? It can get a little complicated but for our purposes, we can say that the offending issue lies within the for loop itself result <- c(result, x). Remember, when the kernel encounters this command, it tries to evaluate the right side of the assignment first. When it goes to look for result it does not exist and cannot complete the assignment. To help facilitate this, we need to declare result outside the loop.

There are a few ways we could do this such as with result <- NULL just so that it exists as an initialized placeholder. Instead we assigned it initially to hold the first iteration of our sequence. Either would have worked but would require different numbers of loop iterations.

If you declared result <- NULL or result <- x within the loop, it would repeat this command with every iteration, thus overwriting it back to a native state with each loop. Nothing would progress! We'll use this concept to springboard us into the idea of scope.


1.1.4 The scope (persistence) of variables is tied to when/where they are declared

Control flow statements as with other compartmentalized sections of code can be thought of as separate rooms in a house or sandboxes in a playground.

Thus a variable is either global or local in scope. If it is local, then the information about it simply disappears at the end of the function or control flow. The scope of a variable can usually be considered as between the {...} of a programming section. After you've left that section, anything explicitly declared within (ie new variables from that section) will be released from memory. Of course, R doesn't exactly play by those rules, and stray variables can float in memory. If you want to ensure that variables from something like a for loop remain local, you can use the local() command or create a function().

Lexical and Dynamic scoping: Going even deeper, R and other programming languages implement what are known as dynamic and lexical scoping. When you create functions within functions they can inherit variables based on the context of their creation. This can affect the behaviour of functions when they are used later within your programs. You can find more information on the rules of dynamic and lexical scoping here.

Why is scope important?

Understanding this concept will save you a lot of troubles down the road as you make more and more complex programs. You'll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let's revisit our example from above.


1.1.4.1 The local() scope isolates your code from the global environment

What happened to our variable result? You can see that it was initially declared as the value of 100. When we entered the local() scope and then had the first iteration of our for() loop the code result <- c(result,x) looked locally first for the values of result and x but these variables did not exist so it pulled the values from the global environment. Subsequently a local result variable was then declared and assigned a value. This local version of result was updated with each iteration but the global version was never altered.

A similar effect is seen when creating and using your own functions (to be discussed) but you can see that the kernel searches for variables (and functions) in the local namespace before checking the global namespace, followed by the namespaces of the loaded packages.

1.1.5 Cycle through values using a for() loop

The most useful thing to do with a for loop is to cycle through values. Let's return to compounds_data and plot methane per day of each individual row using base R's plotting functions (instead of ggplot).


1.1.6 Iterating through a vector of elements in a for() loop

Another handy feature of the for() loop in R is being able to directly give the loop a vector to iterator through until there are no elements left. This will come in handy when applying the same transformations, functions, or calculations on different subsets or elements within a vector.

We'll start with a simple example of looping through a small character vector.


Lets use a t.test() to look for methane production differences between the salinity levels of every day ending in a 2 (starting at day 2), excluding the "brackish" salinity group.


1.1.7 Application of for loops through replication of tapply()

Let's replicate the tapply() function although we don't need it to have the same formatting. tapply() applies "a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors". We'll use it on our dataset compounds_data and save the result in compound_mean.


Now that we see how tapply() works let's mimic this function using a for loop. Note that tapply() returned an array to us but we'll save our results in a data frame since we're more familiar with these.


1.2.0 Generate conditional branches using if() statements

left_exit_dropday.jpg

One of the big advantages of programming is to have conditional statements in your code. R can make binary decisions like "if data meets a condition, do this". Some of these happen implicitly as in a for() loop but you can also declare these decision branches explicity.

The if() (conditional argument) evaluates statements that are either FALSE or TRUE. The general format is

if (boolean expression) {
   // statement(s) will execute if the boolean expression is true.
}

1.2.1 More complex conditional branches may require the else() statement

Now that we know how to use if() statements, what if we want to give a second instruction based on the outcome of the if() statement? The else() and else if() statements exist to extend the conditional branch through additional considerations. In general, the structure looks like this:

if(boolean_expression #1) {
   // statement(s) will execute if the boolean expression #1 is TRUE.
} else if (boolean_expression #2) {
   // statement(s) will execute if the new boolean expression #2 is TRUE.
else {
   // statement(s) will execute if none of the above boolean expressions were TRUE.
}

1.2.2 Use if() statements to generate system messages

If/else statements can also be used to perform system-wide tasks, like generating a warning or breaking a code. For example, if we are writing a file to a directory and there is already a file with the same name, we should generate a warning or simply stop. Without the warning, the existing file will be silently overwritten.


Challenge: Is there a cleaner way to produce our conditional?

1.2.2.1 Use effective control flow to ensure your intentions are met

Despite the warning in our code, the file in our example would still be overwritten. The call to write.csv() is outside the control flow of the conditional if()/else(). To fulfill our true intentions, we should move the placement of the write_csv() function so that it is under the direct influence of the control flow.


1.2.3 The ifelse() is an effective control flow statement for simple tasks

As we've seen a couple of time in lecture now, rather than making a large control flow block for simple tasks, we can supplement the ifelse() command as a way to contain all of our conditional statements and commands in one function. This is a much more powerful command than it appears to be as you can supply a set of vectors to this function as well!

ifelse(boolean_expression_vector, true_outcome_vector, false_outcome_vector)

Watch out for vector recycling! It's convenient for re-assigning values across vectors but note that we aren't performing any complex actions or response. Just assigning outcomes/values based on our evaluation expression.


1.3.0 Running loops without a predetermined end-point

There may be instances where you need to run loops on data until you find a certain piece of information, or until a specific condition is met rather than examining all of the elements within a set. There are two ways you can accomplish these "open-ended" loops.

1.3.1 while() loops run conditionally

Unlike using for() loops which continue to execute until a specific iteration number, the while() loop executes a command as long as a conditional expression continues to evaluate as TRUE at each iteration. This conditional_expression must evaluate as TRUE to begin execution as well. The while() loop can be thought of as a special implementation of an if() statement that repeats over and over again until the conditional fails.

Let's work with some simple examples.


1.3.1.1 Conditional loops can become endless

When programming a conditional loop you must always include a statement that alters the condition or breaks out of (coming up) the loop itself. It's also important to note the order or placement of when you alter the condition in your loops. All the command statements within the loop, unless otherwise specified, will execute before the re-evaluation of the conditional statement.

For example, a programmer is assigned a task: "While you're at the grocery store, buy some eggs". The programmer never came back home.


1.3.2 Using next and break to exit any kind of looping structure

The explicit use of the next and break commands will break free from the current looping structure but each differs in what they do afterwards.

Let's use the following examples to see how these mechanisms work.


1.3.3 repeat loops run endlessly unless specifically interrupted by break

Unlike the while loop, which can end through the conditional being met, a repeat() loop has no explicit conditional statement built into it's formation. Instead, it will continue to repeat until it is broken out of by the break command.


1.3.4 Be mindful of how you iterate through your loops

Depending on the order in which you set up your conditionals, you may accidentally produce unexpected issues. It is best to consider the order in which you want to accomplish tasks within your loops before beginning the next iteration. This is especially relevant in the case of a conditional loop (while() or repeat) where you must include a variable that can eventually meet the conditions for exit.

Loops don't care about you! Although loops and other control flow structures can vastly simplify our code, you must remember they are agnostic to your intentions. These structures have a very specific purpose and design so to program successfully with these, we need to understand their inner workings. Take the time to visually and mentally test your code using a series of base cases by asking yourself what input and output should look like: before the first iteration, after the first iteration, in the middle of your dataset, in your penultimate iteration, in your final iteration. Quickly assessing these on a small test set can also help you identify potential problems!

1.3.4.1 Use loops to simplify your code, but don't re-invent the wheel!

Depending on task you working on, perhaps there is already a function that satisfies your need so you don't have to use explicit for() loops. Make use of existing functions whenever you can because those have already been optimized to be fast and efficient.

Taking advantage of functions can allow you to keep your code clean rather than programming for loops to generate a simple number pattern.

Use R's vectorized functions: Many of the base R functions we've seen over the span of this course work well on vectors. In fact these functions are optimized to work on these data structures and you should take advantage of this. Often completing the same operations in something like for() loop can take much longer. While not apparent on small datasets, you can begin to see the consequences of your choices on much larger ones. Here are some resources that highlight this efficient option.

2.0.0 Increasing our complexity by combining for() loops with ggplot()

Let's say, we are not ready to start making some plots for our manuscript, and we want to make individual plots for each salinity. The code below makes one plot for each salintiy level, i.e. brackish, fresh, or saline, depending on what salinity you pass on to the code.


But what if I were to have, say, 25 salinity levels? In this case, a for loop will be the way to go. Take a look at the following code:


2.1.0 Take advantage of for loop variables to customize output in each loop

From above you can see that we can take advantage of our incrementing variables within the for loop. We can use it to help subset data, generate titles, and file names. You can use it in combination with other control statements to update the image as well! Just remember to avoid generating errors within your for() loop when access or altering data. Ensure you aren't trying to reference or alter data or subsets that do not exist due to missing information in your original datasets.

What if I want one plot for each compound and its sterile control in one page by salinity?


3.0.0 Can we jump around our code to perform different tasks?

Yes! So far we've covered many ways of control flow but all of our programs have been moving in a linear direction from start to end. That is also just a consequence of working with a Jupyter notebook. Programs, however, are not necessarily run in a linear fashion.

What if you need to perform a set of similar instructions multiple times, at multiple points within your control flow? Perhaps it's even the same kind of for() loop on different sets of data? There are a lot of tricks like nested loops but you're better off knowing how to make functions that can be used in other code as well!

The general structure of a script or program can be divided into

  1. Global/environmental variables and declarations
    • Describe your script and assumptions
    • Import your libraries
    • Declare any global variables
  1. Main program
    • The place where the main statements occur.
    • It may also be a function call with specific arguments like the location of data files.
    • Reading through your annotations, someone else should be able to discern what your program is doing.
  1. Helper functions or subroutines
    • Here you can create functions or "mini" programs that do work for you.
    • They can be called from anywhere within the program (once loaded into memory).
    • Repetitive tasks whose output only vary based on the input provided.
    • Subroutines may work together or call on each other to accomplish a greater task.
    • Functions that you use often can be placed into their own files for importing like a package.

3.0.1 Do One Thing - but do it well

A best practice when writing functions is the "Do One Thing" principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of code snippets. By doing the one thing, your functions become:

Time to start writing our own functions.

3.0.2 Document your functions

While we have been using help() and ? to look up documentation on the various functions we've been using, our user-defined functions will not have any kind of accessible documentation. Of course if we were making specific packages for R we could create accessible documentation.

Regardless of this problem, it is best practice to document your functions much like you document the rest of your code. In this case you can include information such as:


3.1.0 Declare your own functions with function()

In R, a function is declared with the following syntax:

function_name = function(parameter1_name, parameter2_name, ... parameter2_name = preset_value) {
    # The specific code of your function goes within the {...}

    return(output)
}

Let's convert our plotting code from above into a simple function!


3.1.1 Once your declared functions are stored in memory, they can be called from anywhere.

Now that our subroutine is stored in memory, it can be called as we want! Maybe for different data sets as long as it meets the requirements set out in our description of the function itself. You can even build upon it to use control flow to decide if it will be faceted or not. The code between the two versions is so similar, you could break it into an if statement.

pokemon_choose_function.jpg

Let's try to use it right now.


3.2.0 Retrieve data from your function using the return() command

Some of your functions may generate subsets of data or results that you would like to further investigate for analysis. For example, when we generate our plots, perhaps we would like to also retrieve information like where the file was saved, along with the subset of data for each.

Using the return() command has two consequences:

  1. It will terminate or exit the function currently running once this command is called.
  2. It will return a single object that will be assigned to a variable or be displayed to the console if unassigned

A special note about the returned object. This can be any kind of object and if you want to return multiple objects, put them in a list()! Let's update our function.

3.2.1 You can nest functions within functions

One of the things you can do as your functions and needs become more complex is to nest functions within other functions. We've already applied this when we call ggplot() functions within save.facet.plot().


3.3.0 Arguments for your functions can have a default value

The last helpful part of making functions it to consider providing default values for some of your arguments. In some cases you may have a subset of datasets that need to be treated differently so including an argument for your function to toggle certain behaviours is helpful. Including these arguments, however, means you have to define them every time you call on the function unless you assign a default value. Default values are only overridden by supplied arguments, otherwise these will be applied within your function.

Let's update our save.facet.plot() one last time to include a default salinity.


3.4.0 User-defined functions can also define functions

While a rarer occurence, your user-defined functions can be used to instantiate and return a function itself. In these cases, the scoping of your variables can become a little trickier but variables within your code can be set using parameters from the initial function.

Let's start with a simple example before we return to our plot-saving function.

Now let's revisit our plot-saving function. We'll make a new plot-setting function that we can use to permanently set the data frame that is used when making plots. We can initialize this newly set function and save it as the function make.compound.plot().


3.5.0 The stop() function exits a function with a message

Sometimes you might produce a function that could fail at a number of points for various reasons. While the R-kernel may simply produce a warning and proceed, you may wish to stop the function wherever it is rather than proceeding. Using the stop() function can help produce "controlled" error stopping points in your program. You can also include an optional message that will help to clarify why you have stopped the function.

First, however, let's produce a simple example of using the stop() function.


Suppose we aren't interested in producing -Inf or NaN values? We can build a wrapper around the log10 function with some conditional branching inside it.


3.6.0 Use tryCatch() to identify errors without stopping

In our above example of stop() the result of using it halts the execution of our code. Instead, sometimes we may wish to note an error has occured but we also want to proceed with the remainder of the code. In that case you can use the tryCatch() function which takes on a somewhat complex structure.

The tryCatch() function can be used to run an expression (or lines of code) and if an error or warning is produced, it can catch the result without halting your programs execution. Additional message information can be produced in each case so that the user can be warned of potential issues. Using tryCatch() takes the form of:

func_name = function(input) {
    out <- tryCatch({ ## This is where we try code that might fail
                     expression(s) },

                     warning = function(condition) {
                     ## statements to execute upon warning
                     message("Optional consolidated warning message") 
                     return() # optional return value
                     },

                     error = function(condition) {
                     ## statements to execute upon error
                     message("Optional consolidated error message") 
                     return() # optional return value 
                     },

                     finally = { 
                     ## Code to complete regardless of an error
                     }
                   ) ## End of tryCatch

    return(out)
}

3.6.1 Remember your functions should do one thing well

Let's focus again on our plotting functions we produced. Previously our versions of save.facet.plot() included steps where the input was being filtered - sometimes by sub-functions that should just be producing a plot object. To remedy this we'll go back to our rule of "Do One Thing" and we'll generate make.facet.plot() so that it's sole purpose is to produce a plot when given salinity.data and a salinity.val.


3.6.2 Call on subfunctions within your function to simplify debugging

Next we want to generate a second function that will be able to filter a set like compounds_data, call on make.facet.plot(), and save the results as needed. In doing so we simplify the debugging process and it will help when we begin to incorporate a tryCatch() structure into our code.


3.6.3 Test the boundary cases of your function

Here's where we need to get creative. What would happen inside save.facet.plot() if we happened to forget to supply a salinity.val parameter to our call? Previously we included a default value like "brackish" but we have no do so here. Using a call like save.face.plot(compounds_data) will produce an error.


3.6.4 Implement a tryCatch() series to try and capture your error

Instead of allowing the execution to halt when we reach an error maybe we can produce some messages and return a null value? In this implementation we will return a NULL value for the user to deal with.


3.6.5 Use tryCatch() to set values within your function

Suppose instead of just returning a NULL value when we produce an error, we can change values on the user's behalf and continue? Of course our example here is in the context of an expected error and we can't always account for the nature of the error(s) we'll encounter. You could make things more complex and try to program some statements to determine the error type!

In our example, we'll try to anticipate the issue of a missing salinity value and "assume" that will be our only problem. We'll take advantage of the <<- scoping assignment operator. It will search the hierarchy of scopes until it can assign a value to the specified variable. This happens in place of R dynamically assigning a local variable.

Let's modify save.facet.plot() function so that our error handler can set the salinity.val variable within save.facet.plot().


Here's an alternative version of our code that runs all of the code within the tryCatch call using the finally option.

Error-catching for the data scientist: While error-catching can seem complicated, it provides all sorts of ways to help save on headaches when debugging your code. As your code increases in complexity you may want to learn more about these systems. You can find a well-written section on debugging your code and using error-handling by Hadley Wickham as well as some helpful examples.

Now that you have the basics, you can continue to build on complexity (or simplicity) as you need it.

coding_mistakes.jpg.jpg


4.0.0 Taking full advantage of your R environment

While working within the R environment we've learned to manipulate data and save it's output as text or excel files. We've also learned to generate our own functions and save output as variables. When we create very useful functions and want to keep the code, there isn't a need to necessarily copy and paste it into every script we make either.

In this last section we will discover how we can import our own functions, save data objects, and load R workspaces into memory.

4.1.0 Keep all of your helper functions and subroutines in a file as a source()

As a final extension of our control flow lesson, you already know about packages - these hold functions and data that are pre-made by others within the R community.

You don't need to make your own packages entirely but you can certainly make source files to keep functions and pertinent variables you may re-use in all of your analyses.

To access a saved "R" file which contains purely code, you can use the source() command. Let's try!


4.2.0 Query your environment with ls() to find variables and functions

After loading your script into memory, you may want to see what is available in your environment's memory. The ls() command allows you to see what is available but it does not discriminate between objects or functions.


4.2.1 Check functions loaded into memory with lsf.str()

As you can see from above, using ls() returns all of the objects currently saved in memory but also the functions we've previously declared and possibly some new ones imported from our call to source(). To see which functions we have loaded outside of those from packages in memory, we can use lsf.str(). Let's see what's new and try something out.


4.3.0 save() objects or your whole kernel memory!

From time to time you may have objects from analyses that aren't simply represented translated back as data tables or excel files. Perhaps you may want to save objects or plots from a complex analysis for later use. You can accomplish this with the save() command by providing a list of one or more objects to save.


4.3.1 save.image() saves your entire workspace

Sometimes you just want to save everything in memory. This may be a safeguard against accidental errors after running long aalyses. The same can be said about saving single objects but you may find this a useful command in the future.


4.4.0 load() .RData files into memory

When you're finally ready to revisit your saved objects or memory, you'll want to restore them. It's as easy as using the command load(). Let's demonstrate, but first we need to clean up our current memory with rm()


5.0.0 There is always more to explore!

Everything_right.jpg

Let's review our time together. Over the span of this course we've discussed

  1. Basic data types, objects and classes in R.
  2. Data manipulation with the dplyr package
  3. Principles of tidy data using the tidyverse package
  4. The grammar of graphics with the ggplot2 package
  5. Regular expressions and string manipulation with stringr
  6. Linear modeling and data analysis with methods like ANOVA
  7. Control flow through looping and functions

You now have the tools to accomplish quite a few tasks and the foundation to grow your skills as needed. Let's run a final function together to celebrate!


5.1.0 Post-course survey

There is no post-lecture assessment this week. Your DataCamp accounts will continue to remain active for another ~4 months during which time you can choose to explore the site's different courses. Please take advantage of this opportunity to keep growing your R skills!

However, we have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!

https://docs.google.com/forms/d/e/1FAIpQLSc6Q5QT_pVoDHYglHxJzydOyV4ZyHkJ_Yt5WKOVimEmmYa1tw/viewform?usp=sf_link


squid_term_project.jpg


5.2.0 Final assignment guidelines (28% of final grade)

Your final project will be due two weeks after this lecture at 13:59 hours on Thursday November 11th. Please submit your final assignment as a single compressed file which will include:

  1. Your Jupyter Notebook final project
  2. A PDF version of the notebook with all output from the code cell. This will be used for markup and comments that I can return to you about your projects.
  3. Any associated data needed to run your project. When you create your compressed file for submission, you can preserve the folder structure by compressing the entire folder with the needed files.

Please refer to the marking rubric found in this courses root directory on JupyterHub for additional instructions.

You can build your Jupyter Notebooks on the UofT JupyterHub and save/download the files to your personal computer for compressing before submitting on Quercus.

JupyterHub.SaveDir.png

Any additional questions can be emailed to me or the TAs or posted to the Discussion section of Quercus. Best of luck!


5.3.0 Acknowledgements

Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.4.0 Your DataCamp academic subscription

This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.

Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.

DataCampLogo.png


5.5.0 Resources

CAGEF_new.png